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1. INTRODUCTION 

In big data era, analysis of large amounts of data become as essential area in Computer Science. The 
methods of data mining, among others clustering methods, classification methods, etc., are needed to extract or 
mine the knowledge from large amounts of data. To group the data in accordance with their multiple-characteristic 
based similarities is known as clustering [1]. 

In recent years, there are two new proposed data clustering algorithms. One of them is Affinity Propa- 
gation (AP) that has been proposed by Brendan J. Frey and Delbert Dueck (2007) [2]. Unlike previous clustering 
method such as k-means which taking random data points as first potential exemplars, AP considers all the data 
points as potential cluster centers [3, 4]. AP works by taking an input of similarity between data points and simul- 
taneously considers all the data points as potential cluster centers which called exemplars by iteratively calculating 
responsibility r and availability a based on the similarity until converge. After the points converge, AP found 
clusters with far less error than k-means and it takes place in less than one hundredth of the amount of time [3]. 

AP have several advantage over other clustering methods due to AP consideration of all data points as 
potential exemplars while most clustering methods find exemplars by recording and tracing the fixed data points 
and iteratively correcting it [3]. Because of it, most clustering methods does not change the set that much and just 
keep tracking on the particular sets. Furthermore, AP supports non-symmetrical similarities and it does not depend 
on initialization that found on other clustering algorithms. Because of these advantages, it has been successfully 
applied in many disciplines such as image clustering [3] and Chinese calligraphy clustering [5]. 
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The paper is the continue study of our previous works [6, 7]. In [6] we survey and investigate the per- 
formance of various AP approaches, i.e. Adaptive AP, Partition AP, Soft Constraint AP, and Fuzzy Statistic AP. 
And it is found that i) Partition AP (PAP) is the fastest one among four other approaches, ii) Adaptive AP (AAP) 
can remove the oscillation and more stable than the other, and iii) Fuzzy Statistic AP (FSAP) could result smaller 
cluster number than the other approach, because its preferences are generated by using fuzzy-statistical methods it- 
eratively. In [7] we investigate a time complexity of various AP approaches, i.e. AAP, PAP, Landmark AP (L-AP), 
and K-AP. And it is found that the approach that has the most efficient computational cost and the fastest running 
time is Landmark AP, although its clustering result is very different in clusters number than AP. 

Although AP itself has been proven to be faster than k-means, and it also has shown much success in data 
clustering, AP still has a limitation, i.e. it is not easy to determine the value of the parameter preference” p which 
can result an optimal clustering solution [8]. The goal of this paper is to resolve this limitation with proposing a 
new model of the parameter ”preference” p, i.e. it is modeled based on the distribution of similarity. The model 
will be explain in the subsection 3.1.. Then it will be applied to Adaptive AP (AAP). This is because AAP has a 
better level of accuracy than other approaches. 

In experiment random non-partition dataset, random partition dataset, and real dataset from UCI datasets 
[9] are used. By partition of dataset k-means algorithm [10] is applied to generate a four groups of data points. 
The results of our experiment are shown in section 4.. 


2. THEORETICAL BACKGROUND 
2.1. Affinity Propagation 


2.1.1. Input Preference 


Supposed we have a set of data points X = {1, £2, £3, ..., &n }, AP takes as input of similarity matrix 
(SM) between data points s, where each similarity s(i, 7) shows how good data point z; is fitted to be an exemplar 
for x;. The similarities of any type can be accepted, e.g. for real data, the negative Euclidean distance, and for 
non-metric data the Jaccard coefficient , so AP can be widely applied in different areas [7]. 

Instead of requiring that the number of clusters be predetermined, AP takes as input a real number s(j, j) 
for each data point j so that data points with larger values of s(j, j) are more likely to be chosen as exemplars. 
These values are referenced to as ’preferences”. These preferences will affect the number of clusters produced. 
The preferences values can be the median of the similarities or their minimum. 


p = median(s(:)), or, p = min(s(:)) (1) 


2.1.2. Messages passing 


Supposed we have similarity s(i, 7), (i, j = 1, 2, ..., n), AP attempts to obtain the best exemplars that can 
make the net similarity maximized, i.e. the roundly sum of similarities between all exemplars and their member 
data points. Process in AP can be viewed as passing values between data points. There are two values that are 
passed between data points: responsibility and availability. Responsibility r (i, j) shows how well-suited point 7 is 
to serve as the exemplar for point 7, taking into account other potential exemplars for point i. Availability a(i, 7) 
reflects the accumulated evidence for how appropriate it would be for point 7 to choose point j as its exemplar, 
taking into account the support from other points that point 7 should be an exemplar. 

Figure 1 shows us how the availability and responsibility works in AP. Availabilities a(i, j) are transmitted 
from candidate exemplars to data points to state the availability of candidate exemplars to data points as cluster 
point. Responsibilities r (i, j) are transmitted from data points to candidate exemplars and state how strongly each 
data point favors the candidate exemplar over other candidate exemplars. All of this message passings are kept 
done until convergence is met or the iteration reach a certain number. 

Initially all r(i, j) and a(i, j) are set to 0, and iteratively their values are updated as follows until conver- 
gence values achieved: 


rli = s(i,j) — marzz;{a(i,k) + s(i,k)} (@49) 
vn ee — maxpg;{ s(t, k)} (i=j) (2) 
ali j) = LMOTP} + Digi mav{O,r(k j)} (AI) 

l i o max{0,r(k, j)} (i = j) (3) 
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Figure 1. Message Passing in Affinity Propagation [3] 


After calculating both responsibility and availability, iteratively their values are updated as follows until 
convergence values achieved: 


r(i, j) = (1—A)r(i, j) +Aro(i, j) (4) 
a(i, j) = (1 — AJa(i, j) + Aao(i, j) 6) 
where A is a damping factor modeled to reduce numerical oscillations, and ro (i, j) and ao (i, 7) are previous values 
of responsibility and availability. A should have value that is greater than or equal to 0.5 and less than 1, i.e. 


0.5 < À < 1. A high value of A may make number oscillations avoided, but this is not guaranteed, and a low value 
of will make the AP run slowly [11]. 


2.1.3. Exemplar decision 


For a set of data point «;, if zj can reach the maximal of r(i, j) + a(i, j), then it could be deduced that i) 

x; is the most suitable exemplar for x; , or that ii) 2; would be the most exemplar of x;. The Exemplar for x; is 
selected as the following formula: 

ci + argmax,{r(i,j) + a(t, 7)} (6) 


where c; is the exemplar for x;. 


2.2. Adaptive Affinity Propagation 


AP has many extensions. One of the extension is Adaptive Affinity Propagation (AAP). AAP is designed 
to solve AP limitation : we can not know what value of preference can result the best clustering solution, and if 
oscillations occurs, it cannot be eliminated automatically. To solve the problem, AAP can adapt to the need of the 
data sets by configuring the value of preferences and damping factor. 

As in [11] we assume that C (i) is the number of clusters in the iteration. We summarize the AAP 
algorithm as follows: 


Algorithm 1 Affinity Propagation Adaptif 


Input : Data Points x;,2 = 1,2,...,n 

Output : Centers of Clusters C (i) 

: Set parameters À = 0.5; 

: Calculate s(i, 7); 

: Set r(i, j) = 0 and a(i, j) = 0; 

: Run AP steps (Eq.2 - Eq.6); 

: Verify whether oscillations occur or not. If oscillations occur, then A < À + Astep, else run AP steps continu- 
ally. 

6: If C(t) < C(t + 1), then p + p+ Pstep, and s(i, i) < p. Go to step 4. 


nk WN 


As proposed in [11] when the values of C (i) is lower than 2, the AAP algorithm stops. 
AAP has shown a better quality or at least same quality in making a clustering result as AP and finding 
optimal solution based on different kind of data sets [11]. AAP has shown to be able to process several type of data 
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such as gene expression [11], travel route [11], image clustering [11, 12], a mixed numbers and categorical dataset 
[13], text document [14], and zoogeographical regions [15]. 


3. PROPOSED WORK 
3.1. Proposed Preference Model 


From a set of data points X = {x1, £2, ..., Zn }, supposed we have randomly two data points x; and xj, if 
distance from z; to other points is larger than to x; , then x; has a lower possibility than x; to be the dataset center. 
On the basis of this assumption, preference for each data point can be computed as follows. 

For a given data point x;, similarities from zx; to other data points are summed up as: 


n 


TDS(i) = ` s(@n2) K 
j=l j#i 
Then it is normalized as: TDS(i) 
a 
NTDS(i) = Samana k 
O= TDS) 


Finally, for each data point preference can be defined as follows : 
pli) = s(i,i) = N x NTDS(i) x Const. (9) 


where Const can be real non-zero number or min(s(:)) 

Equation (9) of preferences represents and reflects the distribution of data set, and also we hope that it 
will tend to better results as shown in results section. Then this model is applied to Modified Adaptive Affinity 
Propagation algorithm (MAAP), as explain in subsection 3.2.. Our model is simple and easy to apply if we compare 
to a another model proposed by Ping Li et.al (2017) [16]. 


3.2. Modified Adaptive Affinity Propagation (MAAP) 


We modify adaptive AP in the following manner 


Algorithm 2 Modified Adaptive Affinity Propagation (MAAP) 

Input: Data Points x;, i = 1,2,...,n 

Output : Centers of Clusters C (i) 

: Set parameters À = 0.5; 

: Calculate s(i, j), and set p as Eq.9 

: Set r(i, j) = 0 and a(i, j) = 0; 

: Run AP steps (Eq.2 - Eq.6); 

: Verify whether oscillations occur or not. If oscillations occur, then A + + Astep, else run AP steps continu- 
ally. 


A BR wWN 


Because we set the proposed preferences p in algorithm, then we could omit the step 6 in Algorithm 1. 


3.3. Cluster Evaluation 


Silhouette validation index and Fowlkes-Mallows index [17] are used to evaluate the quality of a clustering 
process. 
For a given dataset X = {2 1, X2,...,Un}, xi E€ R™, the Silhouette index of x; can be defined as 


Sil(a;) = (b(a;) — a(a;))/max(a(a;), b(z;)) (10) 


where a(;) is defined as a mean distance from other data points in the same cluster K, to x;; d(a;, Ke) represents 
a mean distance from all data points in cluster Ke (c’ 4 c) to x;. If b(x;) is defined as 


b(a;) = min(d(xi, K-)), (11) 


œ =1,2,..,C, (C represents the number of cluster) (C’ 4 C). 
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And for a given cluster X = {x1, £2, ..., £n }, the Silhouette Index of the X dataset can be expressed as 
follows: 


Sil(X) = ($ Sil(ws))/n (12) 
i=1 
Fowlkes-Mallows Index(FMI) is an external criteria [17], and defined as 
a a 
FMI = x 13 
y- +b ate me) 


where a is the number of data with the same label and classified in the same cluster, b is the number of data with 
the same label but classified in different clusters, and c is the number of data with different labels but classified in 
the same cluster. 


4. RESULTS AND DISCUSSION 
MAAP algorithm is written and ran with MATLAB R2014b. The test was carried on 4GB RAM Intel(R) 
Core(TM) 15-2670QM 2.20 GHz machine. We test this MAAP algorithm with: 


e two-dimensional random non-partition data point sets of size 100, 500, 1000, 1500 and 2000 respectively to 
view the scala; 


e two-dimensional random partition data point sets of size 100, 300, 500, 800, and 1000 respectively. The 
random non-partition data points are generated using uniform distribution from 0 to 1. The random partition 
data points are divided into four group and generated using k-means algorithm. 


e Real datasets are used as shown in table 1. These datasets can be downloaded from UCI-Repository [9]. 


Table 1. UCI Datasets 


Datasets True Cluster Number of Sampel Dimension 
4k2_far 4 400 2 
Ionosphere 2 351 4 
Tris 3 150 4 
Wine 3 178 13 


For the similarity, we use negative Euclidean’s distance from the data points: for points x; and zj, 
. . — 2 
s(t, j) == || zi = z; |I’. 


4.1. Experiments on Random Non-Partition Dataset 


The clustering results on random non-partition dataset are presented in table 2, table 3. And Figure 2 
shows an example result with number of data N = 1000. From those tables and Figure 2, although the number of 
cluster NC resulting from the MAAP-DDP algorithm are almost the same those from the original AP with p = 
min(s), MAAP-DDP algorithm is slower than original AP both with p = median(s) and with p = min(s). This 
make sense, because MAAP-DDP algorithm searches adaptively the A-value in order to eliminate the oscillations. 

For various values of N Silhouette index (Sil) for both algorithm are almost the same with the range from 
0.3 to 0.45. Interestingly, the S'il value from MAAP-DDP looks more constant, around 0,325. It means that the 
clusters resulting from the MAAP-DDP is more stable than those from original AP. The FMI can not be calculated 
because the random non-partition dataset do not have true labels. 


4.2. Experiments on Random Partition Dataset 


The clustering results on random 4-partition dataset are presented in table 4, table 5. And Figure 3 shows 
an example result with number of data N = 1000. From those tables and Figure 3, the MAAP-DDP has succeeded 
to identify clusters according to the number of dataset’s true labels. The number of dataset’s true labels is 4, and 
the number of clusters (VC) resulting from the MAAP-DDP is also 4 for various values of N. 

The speed of the MAAP-DDP is comparable with those of original AP, it means that the execution time 
of MAAP-DDP is not slower than those original AP. Although the Sil values of the MAAP-DDP are smaller than 
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Table 2. AP Clustering Results on Random non-partition Data 


AP (p = median(s)) 


AP (p = min(s)) 


N NC ET() Sil NC ET) Sil 
100 10 0.021 0.440 3 0117 0.420 
500 19 0.963 0.374 8 0.531 0.384 
1000 27 3.232 0.369 10 2723 0.371 
1500 34 11.019 0.361 12 8.233 0.359 

2000 37 16.193 0.358 15 11.022 0.367 


N : Number of data 
Sil : Silhoutte Index 


Table 3. MAAP-DDP Clustering Results on Random non-partition Data 


NC : Number of Cluster 
FMI : Fowlkes-Mallows Index 


ET : Execution Time 


MAAP-DDP 
N NC ET\(s) Sil Const 
100 6 0.048 0.325 1.0 
500 8 2.183 0.325 2.0 
1000 10 8.734 0.318 2.0 
1500 13 18.055 0.325 2.0 
2000 15 31.620 0.323 2.0 


ISSN: 2088-8708 


(d) 


Figure 2. (a) Non-partition random data N = 1000; (b) AP with p = min/(s) for non-partition random data 
N = 1000; (c) AP with p = median(s) for non-partition random fata N = 1000; (d) MAAP with distributed- 
data based p for non-partition random data N = 1000;. 


those of AP, FMI index values of the MAAP-DDP are greater than those of original AP. It means that the MAAP- 
DDP is better in recovering the true clustering structure. The key parameter of the MAAP-DDP algorithm is Const 
in Eq.9. The algorithm is designed to find adaptively Const-value for obtaining the best clustering solution. 
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(c) (d) 


Figure 3. (a) 4-Partition random data N = 1000; (b) AP with p = min(s) for 4-partition random data N = 1000; 
(c) AP with p = median(s) for 4-partition random data N = 1000; (d) MAAP with distributed-data based p for 
4-partition random fata N = 1000;. 


Table 4. AP Clustering Results on Random 4-partition Data 


AP (p= median(s)) AP (p= min(s)) 

N NC ET\(s) Sil FMI NC ET(s) Sil FMI 
100 11 0.094 0.358 0.58 0.090 0.368 0.718 
300 18 0.477 0.350 0.38 0.502 0.338 0.639 
500 27 1.502 0.336 0.35 1.444 0.292 0.556 
800 35 4.252 0.348 0.31 4.872 0.317 0.457 
1000 36 6.800 0.343 0.31 5.642 0.304 0.474 
N : Number of data NC : Number of Cluster ET : Execution Time 
Sil : Silhoutte Index FMI : Fowlkes-Mallows Index 


ONARAN 


Table 5. MAAP-DDP Clustering Results on Random 4-partition Data 


MAAP-DDP 
N NC ET(s) Sil FM Const 


100 4 0.164 0.300 0.421 0.029 
300 4 0.578 0.224 0.545 6.457 
500 4 0.898 0.278 0.438 0.410 
800 4 3.641 0.228 0.531 27.554 
1000 4 6.312 0.262 0.474 30.850 


4.3. Experiments on Real Datasets 


From table 6 and table 7 it shows that MAAP has successfully identified the cluster of all real data, i.e. 
the number of clusters is equal to the number of true clusters (TC), while the original AP did not all succeeded 
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to identify. The original AP with p(k) is the minimum value of s(i, k) succeeded to identify clusters of three real 
datasets, except for the real datasets of the Ionosphere. The key parameter of the MAAP-DDP algorithm is Const 
in Eq.9. The algorithm is designed to find adaptively Const-value for obtaining the best clustering solution. 

The success of MAAP-DDP is supported by the values of it’s Sil that are greater than 0,300, and it’s 
FMI that are greater than . The values of it’s Sil are as follows: for 4k2_far 0.455; for Ionosphere 0,513; for Iris 
0.522 and for Wine 0.365. The values of it’s FMI are as follows: for 4k2_far 0.681; for Ionosphere 0.719; for Iris 
0.679 and for Wine 0.622. 


Table 6. AP Clustering Results on Real Datasets 


AP (p = min(s)) AP (p = med(s)) 


TC NC ET(s) Sil FMI NC ET(s) Sil FMI 
4k2 far 4 40,790 0,76l 1,000 6 0,808 0,450 0,743 
Ionosphere 2 4 0,633 0,386 0,575 28 1,517 0,444 0,413 
Iris 3 3 0,129 0,541 0,809 6 0,103 0,469 0,724 
Wine 3 3 0,165 0,491 0,830 11 0,174 0,314 0,468 


NC : Number of Cluster ET : Execution Time 


FMI : Fowlkes-Mallows Index 


TC : True Cluster 
Sil : Silhoutte Index 


Table 7. MAAP-DDP Clustering Results on Real Datasets 


MAAP-DDP 

TC NC  ET(s) Si FMI Const 
4k2 far 4 4 1318 0,455 0,681 4,879 

Ionosphere 2 2 1,228 0,513 0,719 160,200 
Iris 3 3 0,771 0,522 0,679 9,252 
Wine 3 3 0,795 0,365 0,622 7,559 


5. CONCLUSIONS AND FUTURE RESEARCH 

From above results it can be concluded: (i) the proposed algorithm, MAAP-DDP, is slower than original 
AP for random non-partition dataset, (ii) for random 4-partition dataset and real datasets the proposed algorithm 
has succeeded to identify clusters according to the number of dataset’s true labels with the execution times that are 
comparable with those original AP. 

Beside that the MAAP-DDP algorithm demonstrates more feasible and effective than original AAP pro- 
cedure. As we know that the original AAP algorithm stops when the values of C (i) is lower than 2, while MAAP- 
DDP algorithm terminates after the best clusters founded. The key parameter of the MAAP-DDP algorithm is 
Const in Eq.9. The algorithm is designed to find adaptively C'onst-value for obtaining the best clustering solu- 
tion. 

In the future, for verification of the algorithm we have to test the MAAP-DDP algorithm with the other 
kinds of dataset, e.g. synthetics dataset , face-image dataset, and so on. 
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